Goto

Collaborating Authors

 statistical testing


C-QUERI: Congressional Questions, Exchanges, and Responses in Institutions Dataset

arXiv.org Artificial Intelligence

Questions in political interviews and hearings serve strategic purposes beyond information gathering including advancing partisan narratives and shaping public perceptions. However, these strategic aspects remain understudied due to the lack of large-scale datasets for studying such discourse. Congressional hearings provide an especially rich and tractable site for studying political questioning: Interactions are structured by formal rules, witnesses are obliged to respond, and members with different political affiliations are guaranteed opportunities to ask questions, enabling comparisons of behaviors across the political spectrum. We develop a pipeline to extract question-answer pairs from unstructured hearing transcripts and construct a novel dataset of committee hearings from the 108th--117th Congress. Our analysis reveals systematic differences in questioning strategies across parties, by showing the party affiliation of questioners can be predicted from their questions alone. Our dataset and methods not only advance the study of congressional politics, but also provide a general framework for analyzing question-answering across interview-like settings.


CDI: Copyrighted Data Identification in Diffusion Models

arXiv.org Artificial Intelligence

Diffusion Models (DMs) benefit from large and diverse datasets for their training. Since this data is often scraped from the Internet without permission from the data owners, this raises concerns about copyright and intellectual property protections. While (illicit) use of data is easily detected for training samples perfectly re-created by a DM at inference time, it is much harder for data owners to verify if their data was used for training when the outputs from the suspect DM are not close replicas. Conceptually, membership inference attacks (MIAs), which detect if a given data point was used during training, present themselves as a suitable tool to address this challenge. However, we demonstrate that existing MIAs are not strong enough to reliably determine the membership of individual images in large, state-of-the-art DMs. To overcome this limitation, we propose CDI, a framework for data owners to identify whether their dataset was used to train a given DM. CDI relies on dataset inference techniques, i.e., instead of using the membership signal from a single data point, CDI leverages the fact that most data owners, such as providers of stock photography, visual media companies, or even individual artists, own datasets with multiple publicly exposed data points which might all be included in the training of a given DM. By selectively aggregating signals from existing MIAs and using new handcrafted methods to extract features for these datasets, feeding them to a scoring model, and applying rigorous statistical testing, CDI allows data owners with as little as 70 data points to identify with a confidence of more than 99% whether their data was used to train a given DM. Thereby, CDI represents a valuable tool for data owners to claim illegitimate use of their copyrighted data.


Grounding Representation Similarity Through Statistical Testing

Neural Information Processing Systems

To understand neural network behavior, recent works quantitatively compare different networks' learned representations using canonical correlation analysis (CCA), centered kernel alignment (CKA), and other dissimilarity measures. Unfortunately, these widely used measures often disagree on fundamental observations, such as whether deep networks differing only in random initialization learn similar representations. These disagreements raise the question: which, if any, of these dissimilarity measures should we believe? We provide a framework to ground this question through a concrete test: measures should have \emph{sensitivity} to changes that affect functional behavior, and \emph{specificity} against changes that do not. We quantify this through a variety of functional behaviors including probing accuracy and robustness to distribution shift, and examine changes such as varying random initialization and deleting principal components.


The Perturbed Variation

Neural Information Processing Systems

We introduce a new discrepancy score between two distributions that gives an indication on their similarity. While much research has been done to determine if two samples come from exactly the same distribution, much less research considered the problem of determining if two finite samples come from similar distributions. The new score gives an intuitive interpretation of similarity; it optimally perturbs the distributions so that they best fit each other. The score is defined between distributions, and can be efficiently estimated from samples. We provide convergence bounds of the estimated score, and develop hypothesis testing procedures that test if two data sets come from similar distributions. The statistical power of this procedures is presented in simulations. We also compare the score's capacity to detect similarity with that of other known measures on real data.


MMD-B-Fair: Learning Fair Representations with Statistical Testing

arXiv.org Artificial Intelligence

We introduce a method, MMD-B-Fair, to learn fair representations of data via kernel two-sample testing. We find neural features of our data where a maximum mean discrepancy (MMD) test cannot distinguish between representations of different sensitive groups, while preserving information about the target attributes. Minimizing the power of an MMD test is more difficult than maximizing it (as done in previous work), because the test threshold's complex behavior cannot be simply ignored. Our method exploits the simple asymptotics of block testing schemes to efficiently find fair representations without requiring complex adversarial optimization or generative modelling schemes widely used by existing work on fair representation learning. We evaluate our approach on various datasets, showing its ability to ``hide'' information about sensitive attributes, and its effectiveness in downstream transfer tasks.


What's the Difference Between a Metric and a Loss Function?

#artificialintelligence

Have you been using your loss function for evaluating your machine learning system's performance? That's a mistake, but don't worry, you're not alone. It's a widespread misunderstanding that may have something to do with software defaults, college course format, and decision-maker absenteeism in AI. In this article, I'll explain why you need two separate model scoring functions for evaluation and optimizationโ€ฆ and possibly a third one for statistical testing. Throughout data science, you'll see scoring functions (like the MSE, for example) being used for three main purposes: These three are subtly -- but importantly -- different from one another, so let's take a deeper look at what makes a function "good" for each purpose.


Interaction-Aware Sensitivity Analysis for Aerodynamic Optimization Results using Information Theory

arXiv.org Artificial Intelligence

An important issue during an engineering design process is to develop an understanding which design parameters have the most influence on the performance. Especially in the context of optimization approaches this knowledge is crucial in order to realize an efficient design process and achieve high-performing results. Information theory provides powerful tools to investigate these relationships because measures are model-free and thus also capture non-linear relationships, while requiring only minimal assumptions on the input data. We therefore propose to use recently introduced information-theoretic methods and estimation algorithms to find the most influential input parameters in optimization results. The proposed methods are in particular able to account for interactions between parameters, which are often neglected but may lead to redundant or synergistic contributions of multiple parameters. We demonstrate the application of these methods on optimization data from aerospace engineering, where we first identify the most relevant optimization parameters using a recently introduced information-theoretic feature-selection algorithm that accounts for interactions between parameters. Second, we use the novel partial information decomposition (PID) framework that allows to quantify redundant and synergistic contributions between selected parameters with respect to the optimization outcome to identify parameter interactions. We thus demonstrate the power of novel information-theoretic approaches in identifying relevant parameters in optimization runs and highlight how these methods avoid the selection of redundant parameters, while detecting interactions that result in synergistic contributions of multiple parameters.


Better than Average: Paired Evaluation of NLP Systems

arXiv.org Artificial Intelligence

Evaluation in NLP is usually done by comparing the scores of competing systems independently averaged over a common set of test instances. In this work, we question the use of averages for aggregating evaluation scores into a final number used to decide which system is best, since the average, as well as alternatives such as the median, ignores the pairing arising from the fact that systems are evaluated on the same test instances. We illustrate the importance of taking the instance-level pairing of evaluation scores into account and demonstrate, both theoretically and empirically, the advantages of aggregation methods based on pairwise comparisons, such as the Bradley-Terry (BT) model, a mechanism based on the estimated probability that a given system scores better than another on the test set. By re-evaluating 296 real NLP evaluation setups across four tasks and 18 evaluation metrics, we show that the choice of aggregation mechanism matters and yields different conclusions as to which systems are state of the art in about 30% of the setups. To facilitate the adoption of pairwise evaluation, we release a practical tool for performing the full analysis of evaluation scores with the mean, median, BT, and two variants of BT (Elo and TrueSkill), alongside functionality for appropriate statistical testing.


Statistical Testing for Efficient Out of Distribution Detection in Deep Neural Networks

arXiv.org Machine Learning

Commonly, Deep Neural Networks (DNNs) generalize well on samples drawn from a distribution similar to that of the training set. However, DNNs' predictions are brittle and unreliable when the test samples are drawn from a dissimilar distribution. This presents a major concern for deployment in real-world applications, where such behavior may come at a great cost -- as in the case of autonomous vehicles or healthcare applications. This paper frames the Out Of Distribution (OOD) detection problem in DNN as a statistical hypothesis testing problem. Unlike previous OOD detection heuristics, our framework is guaranteed to maintain the false positive rate (detecting OOD as in-distribution) for test data. We build on this framework to suggest a novel OOD procedure based on low-order statistics. Our method achieves comparable or better than state-of-the-art results on well-accepted OOD benchmarks without retraining the network parameters -- and at a fraction of the computational cost.


The Seven Sins of Machine Learning

#artificialintelligence

Machine learning is a great tool that is revolutionizing our world right now. There are lots of great applications in which machine and in particular deep learning has shown to be way superior to traditional methods. Beginning from Alex-Net for Image Classification to U-Net for Image Segmentation, we see great successes in computer vision and medical image processing. Still, I see machine learning methods fail every day. In many of these situations, people fell for one of the seven sins of machine learning.